perm filename ZZZ[7,ALS] blob
sn#032366 filedate 1973-03-29 generic text, type T, neo UTF8
00002 March 29 1973
00004
00006
00008
00010 Some Preliminary Experiments in Speech Recognition
00020 using Signature Table Learning
00030
00040 by
00050 R.B.Thosar and A.L.Samuel
00060
00070 A limited amount of success has been achieved in the
00080 application of the signature table scheme of machine
00090 learning to the problem of automatic speech recognition.
00100 The scheme is based on the assumption that the recognition
00110 system must eventually employ a learning mechanism and that
00120 the acoustic part of the system must start by dealing
00130 with the recognition of fairly elemental speech segments
00140 rather than with words if it is to have general utility.
00150
00160
00170
00180 This paper discribes the general philosophy that is being
00190 followed and some early results that have been obtained in an attempt
00200 to devise elements of a speech recognition system that is not
00210 dependent upon the use of a limited vocabulary and that can recognize
00220 continuous speech by a number of different speakers.
00230
00240 Such a system should be able to function successfully either
00250 without any previous training for the specific speaker in question or
00260 after a short training session in which the speaker would be asked to
00270 repeat certain phrases designed to train the system on those phonetic
00280 utterances that seemed to depart from the previously learned norm. In
00290 either case it is believed that some automatic or semi-automatic
00300 training system should be employed to acquire the data that is used
00310 for the identification of the phonetic information in the speech. We
00320 believe that this can best be done by employing a modification of the
00330 signature table scheme previously discribed by one of us.
00340
00345 The Overall System
00350
00360 The over-all system is envisioned as one in which the more or
00370 less conventional method is used of separating the input speech into
00380 short time slices for which some sort of frequency analysis,
00390 homomorphic, LPC, or the like, is done. We then interpret this
00400 information in terms of significant features by means of a set of
00410 signature tables. At this point we define longer sections of the
00420 speech called EVENTS which are obtained by grouping togather varying
00430 numbers of the original slices on the basis of their similarity.This
00440 then takes the place of other forms of initial segmentation. Having
00450 identified a series of EVENTS in this way we next use another set of
00460 signature tables to extract information from the sequence of events
00470 and combine it with a limited amount of syntactic and semantic
00480 information to define a sequence of phonemes.
00490
00500
00510 Advantages of the Signature Table approach
00520
00530 Signature tables can be used to perform four essential
00540 functions that are required in the automatic recognition of speech.
00550 These functions are: (1) the elimination of superfluous and
00560 redundant information information from the acoustic input stream, (2)
00570 the transformation of the remaining information from one coordinate
00580 system to a more phonetically meaningful coordinate system, (3) the
00590 mixing of acoustically derived data with syntactic, semantic and
00600 linguistic information to obtain the desired recognition, and (4) the
00610 introduction of a learning mechanism.
00620
00630 An early form of Signature Table
00640
00650 For those not familiar with the use of signature tables as
00660 used by Samuel in programs which played the game of checkers, the
00670 concept is best illustrated (Fig.1) by an arrangement of tables used
00680 in the program. There are 27 input terms. Each term evaluates a
00690 specific aspect of a board situation and it is quantized into a
00700 limited but adequate range of values, 7,5,and 3, in this case. The
00710 terms are divided into 9 sets with 3 terms each, forming the 9 first
00720 level tables. Outputs from the first level tables are quantized to 5
00730 levels and combined into 3 second level tables and, finally, into one
00740 third-level table whose output represents the figure of merit of the
00750 board in question.
00760 A signature table has an entry for every possible combination
00770 of the input vector. Thus there are 7*5*3 or 105 entries in each of
00780 the first level tables. Training consists of accumulating two counts
00790 for each entry during a training sequence. Count A is incremented
00800 when the current input vector represents a prefered move and count D
00810 is incremented when it is not the prefered move. The output from the
00820 table is computed as a correlation coeficient
00830 C=(A-D)/(A+D) The figure of merit for a board
00840 is simply the coefficient obtained as the output from the final
00850 table.
00860
00870 The following three advantages emerge from this method of
00880 training and evaluation.
00890 1) Essentially arbitrary inter-relationships between the
00900 input terms are taken in account by any one table. The only loss of
00910 accuracy is in the quantization.
00920 2) The training is a very simple process of accumulating
00930 counts. The training samples are introduced sequentially, and hence
00940 simultaneous storage of all the samples is not required.
00950 3) The process linearizes the storage requirements in the
00960 parameter space. In the case shown this requires only 343 entries
00970 instead of the 105↑9 entries were the entire space to be represented.
00980
00990 The chief dissadvantage of this simple form of table relates
01000 to the highly questional practice of using the correlation
01010 coefficient outputs from some tables as inputs to other tables. This
01020 defect has been overcome in a recent form of table described
01030 elsewhere. The simple system still works remarkablly well as will be
01040 seen by the results below.
01050
01055 Signature Tables for Speech Recognition
01060
01070 The signature tables, as used in speech recognition,must be
01080 particularized to allow for the multi-catagory nature of the output.
01090 Several forms of tables have been investigated. The initial form
01100 tested and used for the data to be presented in this paper uses
01110 tables consisting of two parts, a preamble and the table proper. The
01120 preamble contains: (1) space for saving a record of the current and
01130 recent output reports from the table, (2) identifying information as
01140 to the specific type of table, (3) a parameter that identifies the
01150 desired output from the table and that is used in the learning
01160 process, (4) a gating parameter specifying the input, that is to be
01170 used to gate the table, (6) the gating level to be used and (7)
01180 parameters that identify the sources of the normal inputs to the
01190 table.
01200
01210 All inputs are limited in range and specify either the
01220 absolute level of some basic property or more usually the probability
01230 of some property being present. These inputs may be from the
01240 original acoustic input or they may be the outputs of other tables.
01250 If from other tables they may be for the current time step or for
01260 earlier time steps, (subject to practical limits as to the number of
01270 time steps that are saved).
01280
01290 The output, or outputs, from each table are similarly limited
01300 in range and specify, in all cases, a probability that some
01310 particular significant feature, phonette, phoneme, word segment, word
01320 or phrase is present.
01330
01340 We are limiting the range of inputs and outputs to values
01350 specified by 3 bits and the number of entries per table to 64
01360 although this choice of values is a matter to be determined by
01370 experiment. We are also providing for any of the following input
01380 combinations, (1) one input of 6 bits, (2) two inputs of 3 bits each,
01390 (3) three inputs of 2 bits each, and (4) six inputs of 1 bit each.
01400 The uses to which these differint forms are put will be described
01410 later.
01420
01430 The body of each table contains entries corresponding to
01440 every possible combination of the allowed input parameters. Each
01450 entry in the table actually consists of several parts. There are
01460 fields assigned to accumulate counts of the occurrances of incidents
01470 in which the specifying input values coincided with the different
01480 desired outputs from the table as found during previous learning
01490 sessions and there are fields containing the summarized results of
01500 these learning sessions, which are used as outputs from the table.
01510 The outputs from the tables can then express to the allowed accuracy
01520 all possible functions of the input parameters.
01530
01532 Operation in the Training Mode
01534
01540 When operating in the training mode the program is supplied
01550 with a sequence of stored utterances with accompanying phonetic
01560 transcriptions. Each segment of the incoming speech signal is
01570 analysed (Fourier transforms or inverse filter equivalent) to obtain
01580 the necessary input parmeters for the lowest level tables in the
01590 signature table hierarchy. At the same time reference is made to a
01600 table of phonetic "hints" which prescribe the desired outputs from
01610 each table which correspond to all possible phonemic inputs. The
01620 signature tables are then processed.
01630
01640 The processing of each table is done in two steps, one
01650 process at each entry to the table and the second only periodically.
01660 The first process consists of locating a single entry line within the
01670 table as specified by the inputs to the table and adding a 1 to the
01680 appropriate field to indicate the presence of the property specified
01690 by hint table as corresponding to the phoneme specified in the
01700 phonemic transcription. At this time a report is also made as to the
01710 table's output as determined from the averaged results of previous
01720 learning so that a running record may be kept of the performance of
01730 the system. At periodic intervals all tables are updated to
01740 incorporate recent learning results. To make this process easily
01750 understandable, let us restrict our attention to a table used to
01760 identify a single significant feature say Voicing. The hint table
01770 will identify whether or not the phoneme currently being processed is
01780 to be considered voiced. If it is voiced, a 1 is added to the "yes"
01790 field of the entry line located by the normal inputs to the table. If
01800 it is not voiced, a 1 is added to the "no" field. At updating time
01810 the output that this entry will subsequently report is determined by
01820 dividing the accumulated sum in the "yes" field by the sum of the
01830 numbers in the "yes" and the "no" fields, and reporting this quantity
01840 as a number in the range from 0 to 7. Actually the process is a bit
01850 more complicated than this and it varies with the exact type of table
01860 under consideration, as reported in detail in appendix B. Outputs
01870 from the signature tables are not probabilities, in the strict sense,
01880 but are the statistically-arrived-at odds based on the actual
01890 learning sequence.
01900
01910 The preamble of the table has space for storing tweive past
01920 outputs. An input to a table can be delayed to that extent.This table
01930 relates outcomes of previous events with the present hint-the
01940 learning input.A certain amount of context dependent learning is thus
01950 possible with the limitation that the specified delays are constant.
01960
01970 The interconnected hierarchy of tables form a network which
01980 runs increamentally, in steps synchronous with time window over which
01990 the input signal is analised.The present window width is set at 12.8
02000 ms.(256 points at 20 K samples/sec.) with overlap of 6.4 ms. Inputs
02010 to this network are the parameters abstracted from the frequency
02020 analyses of the signal, and the specified hint.The outputs of the
02030 network could be either the probability attached to every phonetic
02040 symbol or the output of a table associated with a feature such as
02050 voiced,vowel ect.The point to be made is that the output generated
02060 for a segment is essentially independent of its contiguous
02070 segments.The dependency achieved by using delayes in the inputs is
02080 invisible to the outputs.The outputs thus report the best estimate on
02090 what the current acoustic input is with no relation to the past
02100 outputs.Relating the successive outputs along the time dimension is
02110 realised by counters.
02120
02122 The Use of COUNTERS
02124
02126 The transition from initial segment space to event space is
02128 made posible by means of COUNTERS which are summed and reiniated
02129 whenever their inputs cross specified threshold values, being
02131 triggered on when the input exceeds the threshold and off when it
02133 falls below. Momentary spikes are eliminated by specifying time
02170 hysteresis, the number of consecutive segments for which the input
02180 must be above the threshold.The output of a counter provides
02190 information about starting time,duration and average input for the
02200 period it was active.
02210
02220 Since a counter can reference a table at any level in the
02230 hierarchy of tables, it can reflect any desired degree of information
02240 reduction. For example, a counter may be set up to show a section of
02250 speech to be a vowel,a front vowel or the vowel /I/.The counters can
02260 be looked upon to represent a mapping of parameter-time space into a
02270 feature-time space, or at a higher level symbol-time space.It may be
02280 useful to carry along the feature information as a back up in those
02290 situations where the symbolic information is not acceptable to
02300 syntactic or semantic interpretation.
02310
02320 In the same manner as the tables, the counters run completely
02330 independent of each other.In a recognition run the counters may
02340 overlap in arbitrary fashion, may leave out gaps where no counter has
02350 been triggered or may not line up nicely.A properly segmented output,
02360 where the consecutive sections are in time sequence and are neatly
02370 labled, is essential for processing it further.This is achieved by
02380 registering the instants when the counters are triggered or
02390 terminated to form time segments called events.
02400
02410 An event is the period between successive activation or
02420 termination of any counter.An event shorter than a specified time is
02430 merely ignored. A record of event durations and upto three active
02440 counters, ordered according to their probability, is maintained.
02450
02460 An event resulting from the processing described so far,
02470 represents a phonette - one of the basic speech categories defined
02480 as hints in the learning process. It is only an estimate of closeness
02490 to a speech category , based on past learning.Also each category has
02500 a more-or-less stationary spectral characterisation.Thus a category
02510 may have a phonemic equivalent as in the case of vowels , it may be
02520 common to phoneme class as for the voiced or unvoiced stop gaps or it
02530 may be subphonemic as a T-burst or a K-burst.The choices are
02540 based on acoustic expediency, i.e. optimisation of the learning
02550 rather than any linguistic considerations.However a higher level
02560 interpretive programs may best operate on inputs resembling phonemic
02570 trancription.The contiguous events may be coalesced into phoneme like
02580 units using diadic or triadic probabilities and acoustic-phonetic
02590 rules particular to the system.For example, a period of silence
02600 followed by a type of burst or a short friction may be combined to
02610 form the corresponding stop.A short friction or a burst following a
02620 nasal or a lateral may be called a stop even if the silence period is
02630 short or absent.Clearly these rules must be specific to the system,
02640 based on the confidence with which durations and phonette categories
02650 are recognised.
02660
02670 While it would be possible to extend this bottom up approach
02680 still further, it seems reasonable to break off at this point and
02690 revert to a top down approach from here on. The real difference in
02700 the overall system would then be that the top down analysis would
02710 deal with the outputs from the signature table section as its
02720 primatives rather than with the outputs from the initial measurements
02730 either in the time domain or in the frequency domain. In the case
02740 of inconsistancies the system could either refer to the second choices
02750 retained within the signature tables or if need be could always go
02760 clear back to the input parameters. The decision as to how far to
02770 carry the initial bottom up analysis must depend upon the relative
02780 cost of this analysis both in complexity and processing time and
02790 the certainty with which it can be performed as compaired with the
02800 costs associated with the rest of the analysis and the certainty
02810 with which it can be performad, taking due notice of the costs in
02820 time of recovering from false starts.
02830